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| Abstract 

Metric and kernel learning are important in several machine learning applications. However, 
' most existing metric learning algorithms arc limited to learning metrics over low-dimensional 

, data, while existing kernel learning algorithms are often limited to the transductive setting and 

do not generalize to new data points. In this paper, we study metric learning as a problem of 
learning a linear transformation of the input data. We show that for high-dimensional data, 
a particular framework for learning a linear transformation of the data based on the LogDet 
i— i divergence can be efficiently kernelized to learn a metric (or equivalently, a kernel function) over 

an arbitrarily high dimensional space. We further demonstrate that a wide class of convex loss 
I 1 , functions for learning linear transformations can similarly be kernelized, thereby considerably 

' expanding the potential applications of metric learning. We demonstrate our learning approach 

O . by applying it to large-scale real world problems in computer vision and text mining. 

■ 1 Introduction 

<n : 

One of the basic requirements of many machine learning algorithms (e.g., semi-supervised clustering 
\f} . algorithms, nearest neighbor classification algorithms) is the ability to compare two objects to 

compute a similarity or distance between them. In many cases, off-the-shelf distance or similarity 
functions such as the Euclidean distance or cosine similarity are used; for example, in text retrieval 
applications, the cosine similarity is a standard function to compare two text documents. However, 
such standard distance or similarity functions are not appropriate for all problems. 
. ^ ' Recently, there has been significant effort focused on learning how to compare data objects. One 

^ ■ approach has been to learn a distance metric between objects given additional side information such 

as pairwise similarity and dissimilarity constraints over the data. 

One class of distance metrics that has shown excellent generalization properties is the Maha- 
lanobis distance function [DKJ+07, XNJR02, WBS05, GR05, SSSN04]. The Mahalanobis distance 
can be viewed as a method in which data is subject to a linear transformation, and then distances 
in this transformed space are computed via the standard squared Euclidean distance. Despite their 
simplicity and generalization ability, Mahalanobis distances suffer from two major drawbacks: 1) 
the number of parameters grows quadratically with the dimensionality of the data, making it dif- 
ficult to learn distance functions over high-dimensional data, 2) learning a linear transformation is 
inadequate for data sets with non-linear decision boundaries. 

To address the latter shortcoming, kernel learning algorithms typically attempt to learn a kernel 
matrix over the data. Limitations of linear methods can be overcome by employing a non-linear 
input kernel, which effectively maps the data non-linearly to a high-dimensional feature space. 
However, many existing kernel learning methods are still limited in that the learned kernels do not 
generalize to new points [KT03, KSD06, TRW05]. These methods are restricted to learning in the 
transductive setting where all the data (labelled and unlabeled) is assumed to be given upfront. 
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There has been some work on learning kernels that generalize to new points, most notably work 
on hyperkernels [OSW03], but the resulting optimization problems are expensive and cannot be 
scaled to large or even medium-sized data sets. 

In this paper, we explore metric learning with linear transformations over arbitrarily high- 
dimensional spaces; as we will see, this is equivalent to learning a parameterized kernel function 
(j){x) T W <p{y) given an input kernel function cj)(x) T cf){y) . In the first part of the paper, we focus 
on a particular loss function called the LogDet divergence, for learning the positive definite matrix 
W. This loss function is advantageous for several reasons: it is defined only over positive defi- 
nite matrices, which makes the optimization simpler, as we will be able to effectively ignore the 
positive definiteness constraint on W. The loss function has precedence in optimization [Fle91] 
and statistics [JS61]. An important advantage of our method is that the proposed optimization 
algorithm is scalable to very large data sets of the order of millions of data objects. But perhaps 
most importantly, the loss function permits efficient kernelization, allowing the learning of a lin- 
ear transformation in kernel space. As a result, unlike transductive kernel learning methods, our 
method easily handles out-of-sample extensions, i.e., it can be applied to unseen data. 

Later in the paper, we extend our result on kernelization of the LogDet formulation to other 
convex loss functions for learning W, and give conditions for which we are able to compute and 
evaluate the learned kernel functions. Our result is akin to the representer theorem for reproducing 
kernel Hilbert spaces, where the optimal parameters can be expressed purely in terms of the training 
data. In our case, even though the matrix W may be infinite-dimensional, it can be fully represented 
in terms of the constrained data points, making it possible to compute the learned kernel function 
value over arbitrary points. 

Finally, we apply our algorithm to a number of challenging learning problems, including ones 
from the domains of computer vision and text mining. Unlike existing techniques, we can learn 
linear transformation-based distance or kernel functions over these domains, and we show that the 
resulting functions lead to improvements over state-of-the-art techniques for a variety of problems. 

2 Related Work 

Most of the existing work in metric learning has been done in the Mahalanobis distance (or metric) 
learning paradigm, which has been found to be a sufficiently powerful class of metrics for a variety 
of different data. One of the earliest papers on metric learning [XNJR02] proposes a semidefinite 
programming formulation under similarity and dissimilarity constraints for learning a Mahalanobis 
distance, but the resulting formulation is slow to optimize and has been outperformed by more 
sophisticated techniques. More recently, [WBS05] formulate the metric learning problem in a large 
margin setting, with a focus on /c-NN classification. They also formulate the problem as a semidef- 
inite programming problem and consequently solve it using a method that combines sub-gradient 
descent and alternating projections. [GR05] proceed to learn a linear transformation in the fully su- 
pervised setting. Their formulation seeks to 'collapse classes' by constraining within-class distances 
to be zero while maximizing between-class distances. While each of these algorithms was shown 
to yield improved classification performance over the baseline metrics, their constraints do not 
generalize outside of their particular problem domains; in contrast, our approach allows arbitrary 
linear constraints on the Mahalanobis matrix. Furthermore, these algorithms all require eigenvalue 
decompositions or semi-definite programming, an operation that is cubic in the dimensionality of 
the data. 

Other notable work where the authors present methods for learning Mahalanobis metrics in- 
cludes [SSSN04] (online metric learning), Relevant Components Analysis (RCA) [SHWP02] (similar 
to discriminant analysis), locally-adaptive discriminative methods [HT96], and learning from rela- 
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tive comparisons [SJ03]. In particular, the method of [SSSN04] provided the first demonstration of 
Mahalanobis distance learning in kernel space. Their construction, however, is expensive to com- 
pute, requiring cubic time per iteration to update the parameters. As we will see, our LogDet-based 
algorithm can be implemented more efficiently. 

Non-linear transformation based metric learning methods have also been proposed, though these 
methods usually suffer from suboptimal performance, non-convexity, or computational complexity. 
Some example methods include neighborhood component analysis (NCA) [GRHS04] that learns a 
distance metric specifically for nearest-neighbor based classification; the convolutional neural net 
based method of [CHL05]; and a general Riemannian metric learning method [Leb06]. 

There have been several recent papers on kernel learning. As mentioned in the introduction, 
much of the research is limited to learning in the transductive setting, e.g. [KT03, KSD06, TRW05]. 
Research on kernel learning that does generalize to new data points includes multiple kernel learn- 
ing [LCB + 04], where a linear combination of base kernel functions are learned; this approach has 
proven to be useful for a variety of problems, such as object recognition in computer vision. Another 
approach to kernel learning is to use hyperkernels [OSW03], which consider functions between ker- 
nels, and learn in the appropriate reproducing kernel Hilbert space between such functions. In both 
cases, semidefinite programming is used, making the approach impractical for large-scale learning 
problems. Recently, some work has been done on making hyperkernel learning more efficient via 
second-order cone programming [TK06], however this formulation still cannot be applied to large 
data sets. Concurrent to our work in showing kernelization for a wide class of convex loss functions, 
a recent paper considers kernelization of other Mahalanobis distance learning algorithms such as 
LMNN and NCA [CKTK08]. The latter paper, which appeared after the conference version of the 
results in our paper, presents a representer-type theorem and can be seen as complementary to the 
general kernelization results (see Section 4) we present in this paper. 

The research in this paper extends work done in [DKJ + 07], [KSD06], and [DD08]. While the 
focus in [DKJ + 07] and [DD08] was solely on the LogDet divergence, in this work we characterize 
kernelization of a wider class of convex loss functions. Furthermore, we provide a more detailed 
analysis of kernelization for the Log Determinant loss, and include experimental results on large 
scale kernel learning. We extend the work in [KSD06] to the inductive setting; the main goal in 
[KSD06] was to demonstrate the computational benefits of using the LogDet and von Neumann 
divergences for learning low-rank kernel matrices. Finally in this paper, we do not consider online 
models for metric and kernel learning, however interested readers can refer to [JKDG08]. 

3 Metric and Kernel Learning via the LogDet Divergence 

In this section, we introduce the LogDet formulation for linearly transforming the data given 
a set of pairwise distance constraints. As discussed below, this is equivalent to a Mahalanobis 
metric learning problem. We then discuss kernelization issues of the formulation and present 
efficient optimization algorithms. Finally, we address limitations of the method when the amount 
of training data is large, and propose a modified algorithm to efficiently learn a kernel under such 
circumstances. 

3.1 Mahalanobis Distances and Parameterized Kernels 

First we introduce the framework for metric and kernel learning that is employed in this paper. 
Given a data set of objects X = [sci, x n ], X{ £ M. d (when working in kernel space, the data 
matrix will be represented as X = \(j)(x\), <p(x n )], where <j) is the mapping to feature space), 
we are interested in finding an appropriate distance function to compare two objects. We consider 
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the Mahalanobis distance, parameterized by a positive definite matrix W; the squared distance 
between two points Xi and Xj is given by 

d w (xi,Xj) = (xi - Xj) T W(xi - xj). 

This distance function can be viewed as learning a linear transformation of the data and measuring 
the squared Euclidean distance in the transformed space. This is seen by factorizing the matrix 
W = G T G and observing that dw{xi,Xj) = \\Gx{ — GxjW^- However, if the data is not linearly 
separable in the input space, then the resulting distance function may not be powerful enough for 
the desired application. As a result, we are interested in working in kernel space; that is, we can 
express the Mahalanobis distance in kernel space after applying an appropriate mapping (ft from 
input to feature space: 

d w (xi, xj) = (4>(xi) - 4>(xj)) T W(4>(xi) - 4>(xj)). 

As is standard with kernel-based algorithms, we require that this distance be computable given the 
ability to compute the kernel function Ko(x,y) = (f)(x) T 4>{y) . We can therefore equivalently pose 
the problem as learning a parameterized kernel function n(x,y) = (j)(x) T W cf)(y) given some input 
kernel function Ko(x,y) = (j)(x) T (p(y) . 

To learn the resulting metric/kernel, we assume that we are given constraints on the desired 
distance function. In this paper, we assume that pairwise similarity and dissimilarity constraints are 
given over the data — that is, pairs of points that should be similar under the learned metric/kernel, 
and pairs of points that should be dissimilar under the learned metric/kernel. Such constraints 
are natural in many settings; for example, given class labels over the data, points in the same 
class should be similar to one another and dissimilar to points in different classes. However, our 
approach is general and can accommodate other potential constraints over the distance function, 
such as relative distance constraints. 

The main challenge is in finding an appropriate loss function for learning the matrix W so that 
1) the resulting algorithm is scalable and efficiently computable in kernel space, 2) the resulting 
metric/kernel yields improved performance on the underlying machine learning problem, such as 
classification, semi-supervised clustering etc. We now move on to the details. 

3.2 LogDet Metric Learning 

The LogDet divergence between two positive definite matrices 1 W, Wq £ W dxd is defined to be 

Da(W, W ) = tviWWQ 1 ) - logdetiWWQ 1 ) - d. 

We are interested in finding W that is closest to Wo as measured by the LogDet divergence but 
that satisfies our desired constraints. When Wo = I, this formulation can be interpreted as a 
maximum entropy problem. Given a set of similarity constraints S and dissimilarity constraints D, 
we propose the following problem: 

min D td (W,I) 

s.t. dw(xi,Xj)<u, (i,j)eS, (2-1) 
d w (xi,Xj) > £, (i,j) G V. 

1 The definition of LogDet divergence can be extended to the case when Wo and W are rank deficient by appropriate 
use of the pseudo-inverse. The interested reader may refer to [KSD06]. 
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The above problem was considered in [DKJ + 07]. LogDet has many important properties that make 
it useful for machine learning and optimization, including scale-invariance and preservation of the 
range space. Please see [KSD08] for a detailed discussion on the properties of LogDet. Beyond this, 
we prefer LogDet over other loss functions (including the squared Frobenius loss as used in [SSSN04] 
or a linear objective as in [WBS05]) due to the fact that the resulting algorithm turns out to be 
simple and efficiently kernelizable. We note that formulation (3.1) minimizes the LogDet divergence 
to the identity matrix /. This can be generalized to arbitrary positive definite matrices Wo, however 

^ /<2 1/2 

without loss of generality we can consider Wq = I since D^(W,Wq) = D^(W Q 1 WW ,1). 
Further, formulation (3.1) considers simple similarity and dissimilarity constraints over the learned 
Mahalanobis distance, but other linear constraints are possible. Finally, the above formulation 
assumes that there exists a feasible solution to the proposed optimization problem; extensions to 
the infeasible case involving slack variables are discussed later (see Section 3.5). 

3.3 Kernelizing the Problem 

We now consider the problem of kernelizing the metric learning problem. Subsequently, we will 
present an efficient algorithm and discuss generalization to new points. 

Given a set of n constrained data points, let Kq denote the input kernel matrix for the data, i.e. 
Koihj) = K ( x i, x j) = 4'(. x i) T( l ) (. x j)- Note that the squared Mahalanobis distance in kernel space 
may be written as dw{(t>{ x i)-, <j>(xj)) = K{xi,Xi) + K(xj,xj) — 2K(xi,Xj), where K is the learned 
kernel matrix; equivalently, we may write the squared distance as tn(K (ej — ej)(ej — ej) T ), where 
ej is the i-th canonical basis vector. Consider the following problem to find K: 



This kernel learning problem was first proposed in the transductive setting in [KSD06], though no 
extensions to the inductive case were considered. Note that problem (3.1) optimizes over a d x d 
matrix W, while the kernel learning problem (3.2) optimizes over an n x n matrix K. We now 
present our key theorem connecting problems (3.1) and (3.2). 

Theorem 3.1. Let W* be the optimal solution to problem (3.1) and let K* be the optimal solution 
to problem (3.2). Then the optimal solutions are related by the following: 



K* = X T W*X, 
W* = I + XMX T , 
where M = K^(K* - K )K^\ K = X T X, X = [<f>( Xl ), <f>(x 2 ), . . . , <f>(x n )] . 



To prove this theorem, we first prove a lemma for general Bregman matrix divergences, of which 
the LogDet divergence is a special case. Consider the following general optimization problem: 

min D^(W,W ) 
w 

s.t. tr(WRi) <Si, VI < i < m, 



min D id (K,K ) 

s.t. tr(K(ei - ej)(ei - ej) T ) < u (i,j)eS, 
tx(K( ei - ej )(e; - ej) T ) > £ e V. 



(3.2) 



wto, 



(3.3) 
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where D$(W,Wo) is a Bregman matrix divergence [KSD06] generated by a real- valued strictly 
convex function over symmetric matrices <j> : M nxn — ► R, i.e., 

D*(W, W ) = <f,(W) - <j>(W ) - tv((W - W ) T V^(W )). (3.4) 

Note that the LogDet divergence is generated by 4>(W) = — logdetVF. 

Lemma 3.2. The solution to the dual of the primal formulation (3.3) is given by: 

max <P(W) - <j>(W ) - tr{WV(j){W)) + tr(W V^(W )) - s(A) 

s.t. V<j){W) = V0(W o ) - RW + Z, (3.5) 
A > 0, Z t0, (3.6) 

where s(A) = Ya=i and R W = YT=i 
Proof. First, consider the Lagrangian of (3.3): 

L(W, A, Z) = D^W, W ) + tv(WR(X)) - s(X) - tr(WZ), 

m m 

where 12(A) = ^ Ai-Rj, a(A) = ^ A^, Z h 0, A > 0. (3.7) 

i=l i=l 

Now, note that 

Vh/A^W, Wo) = V^(W) - V<j>(W ). (3.8) 
Setting the gradient of the Lagrangian with respect to W to be zero and using (3.8), we get: 

V<j){W) - V(j){Wo) + 12(A) - Z = 0, (3.9) 
and so, tv(WV<p{W )) = tv(WV(/)(W)) + tv{WR{X)) - tr(WZ). (3.10) 

Now, substituting (3.10) into the Lagrangian, we get: 

L(W, A, Z) = cp{W) - 4>{Wo) - ix(WV<f>(W)) + tr(W V<£(W )) - s(A), 

where V$>(W) = V^>(Wo) — -R(^) + ^- The lemma now follows directly. □ 

To prove Theorem 3.1, we will also need the following well-known lemma: 
Lemma 3.3. det(l + AB) = det(l + BA) for all A G R mxn , B G E nxm . 

We are now ready to prove Theorem 3.1. 

Proof, of Theorem 3.1. First we observe that the squared Mahalanobis distances from the 
constraints in (3.1) may be written as 

d w (xi, xj) = tr(W(xi - Xj)(xi - xj) T ) 

= tr(WX( ei - e.-Xe, - e 3 ) T X T ). 

The objective in problem (3.1), Da(W,I), is defined only for positive definite W and is a 
convex function of W, hence using Slater's optimality condition, Z = (in Lemma 3.2) and may 
be removed from the constraints. Further, note that the LogDet divergence Di&(-, ■) is a Bregman 
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matrix divergence with generating function <fi(W) = — log det (W). Thus using \7(p(W) = —W 1 
and Lemma 3.2, the dual of problem (3.1) is given by: 

min log det W + 6(A) 
w,x 

s.t. W' 1 = I + XC(X)X T , (3.11) 
A > 0, 

where C(A) = E(ij)eS M e i- e i)( e i- e j) T -E(ij)e© M e i- e .?')( e i- e j) T and 6(A) = E(i,i)eS Ayu- 
E(ij)ex> 

Now, for matrices feasible for problem (3.11), log det W = — log det W~ x = — logdet(/ + 
AC(A)X T ) = — log det (/ + C(X)Kq), where the last equality follows from Lemma 3.3 (recall that 
Ko = X T X). Since, logdet(AB) = log det A + log det B for square matrices A and B, (3.11) may 
be rewritten as 

min - logdet(K - 1 + C(A)) + 6(A), 

A 

s.t. A > 0. (3.12) 

Writing K' 1 = K^ 1 + C(A), the above can be written as: 

min log det K + 6(A) , 
K,x 

s.t. K~ l = K~ l +C(A),A > 0. (3.13) 

The above problem can be seen by inspection to be identical to the dual problem of (3.2) as given by 
Lemma 3.2. Hence, since their dual problems are identical, problems (3.1) and (3.2) are equivalent. 
Using (3.11) and the Sherman-Morrison- Woodbury formula, the form of the optimal W* is: 

W* = I - XiCiX*)" 1 + Ro^X? = I + XMX T , 

where A* is the dual optimal and M = — (C(A*) _1 j tKq)~ 1 . Similarly, using (3.13), the optimal 
K* is given by: 

k* = k - Koiciyy 1 + Kq^Kq = x T w*x 

We can explicitly solve for M as M = Kq 1 (K* — Kq)Kq 1 by simplification of these expressions 
using the fact that K = X T X. This proves the theorem. 

□ 

We now generalize the above theorem to regularize against arbitrary positive definite matrices 

W . 

Corollary 3.4. Consider the following problem: 

min D ed (W, W ) 

s.t. dw(xi,Xj)<u (i,j)eS, (3-14) 
d w (xi,Xj) > i (i,j) € V. 

Let W* be the optimal solution to problem (3.14) and let K* be the optimal solution to problem (3.2). 
Then the optimal solutions are related by the following: 

K* = X T W*X 
W* = Wo + W XMX T W , 
where M = Kq\K* - K )K^\ K = X T W X, X = [<f>(xi), <f>(x 2 ), <j){x n )) 
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Proof. Note that D M (W, W ) = D ed (W 1/2 WW 1/2 , 1). Let W = W 1/2 WW 1/2 . Problem (3.14) 
is now equivalent to: 

min 

Who 

s.t. 



— 1/2 1/2 — 1/2 

where W = W 1 WW 1 , X = Wq' X and X = [sBi,i 2 , • • - ,*»»]■ Now using Theorem 3.1, the 
optimal solution VF* of problem (3.15) is related to the optimal K* of problem (3.2) by K* = 
X T W*X = X t Wq /2 Wq 1/2 W*Wq 1/2 Wq /2 X = X T W*X. Similarly, W* = Wq /2 W*Wq /2 = W + 
W XMX T W where M = Kq 1 {K* - Kq)Kq 1 . □ 

Since the kernelized version of LogDet metric learning can be posed as a linearly constrained 
optimization problem with a LogDet objective, similar algorithms can be used to solve either 
problem. This equivalence implies that we can implicitly solve the metric learning problem by 
instead solving for the optimal kernel matrix K*. Note that using LogDet divergence as objective 
function has two significant benefits over many other popular loss functions: 1) the metric and 
kernel learning problems (3.1), (3.2) are both equivalent and hence solving the kernel learning 
formulation directly provides an out of sample extension (see Section 3.4 for details), 2) projection 
with respect to the LogDet divergence onto a single distance constraint has a closed form solution, 
thus making it amenable to an efficient cyclic projection algorithm (refer to Section 3.5). 

3.4 Generalizing to New Points 

In this section, we see how to generalize to new points using the learned kernel matrix K*. 

Suppose that we have solved the kernel learning problem for K* (from now on, we will drop the 
* superscript and assume that K and W are at optimality). The distance between two points 4>(xi) 
and <fi(xj) that are in the training set can be computed directly from the learned kernel matrix 
as K(i,i) + K(j,j) — 2K(i,j). We now consider the problem of computing the learned distance 
between two points 4>(zi) and 4>{z2) that may not be in the training set. 

In Theorem 3.1, we showed that the optimal solution to the metric learning problem can be 
expressed as W = I + XMX T . To compute the Mahalanobis distance in kernel space, we see that 
the inner product (j)(zi) T W<p(z2) can be computed entirely via inner products between points: 

^fW^) = c/>(z 1 f(I + XMX T )(l>(z 2 ) 

= ^( Zl ) T <P(z 2 ) + ^ Zl ) T XMX T ^z 2 ) 

= k(zi,z 2 ) + kfMk 2 , where ki = asi), K(zi,x n )] T . (3.16) 

Thus, the expression above can be used to evaluate kernelized distances with respect to the learned 
kernel function between arbitrary data objects. 

In summary, the connection between kernel learning and metric learning allows us to generalize 
our metrics to new points in kernel space. This is performed by first solving the kernel learning 
problem for K, then using the learned kernel matrix and the input kernel function to compute 
learned distances via (3.16). 



D ed (W,I) 
d^(xi,Xj) > I 6 V, 



(3.15) 
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Algorithm 1 Metric/Kernel Learning with the LogDet Divergence 



Input: Kq: input n x n kernel matrix, S: set of similar pairs, T>: set of dissimilar pairs, u,£: 

distance thresholds, 7: slack parameter 
Output: K: output kernel matrix 

1. K <— K Q , Xij <— V ij 

2. £ij <— u for G S; otherwise £y <— ^ 

3. repeat 

3.1. Pick a constraint (£, j) G 5 or 2? 

3.2. p <- (ej - e j ) T K(e l - e 3 -) 

3.3. <5 <— 1 if G S, —1 otherwise 

3.4. a-min(A i ,,^(|-i)) 

3.5. 0<-6a/(l-6ap) 

3.6. Cij <- 7Zij/(l + 8a£ij) 

3.7. Ay <— A y — a 

3.8. A' <- K + 0K(ei - e^fe - ej) T K 

4. until convergence 
return K 



3.5 Kernel Learning Algorithm 

Given the connection between the Mahalanobis metric learning problem for the d x d matrix W and 
the kernel learning problem for the n x n kernel matrix K, we would like to develop an algorithm 
for efficiently performing metric learning in kernel space. Specifically, we provide an algorithm (see 
Algorithm 1) for solving the kernelized LogDet metric learning problem, as given in (3.2). 

First, to avoid problems with infeasibility, we incorporate slack variables into our formulation. 
These provide a tradeoff between minimizing the divergence between K and Kq and satisfying the 
constraints. Note that our earlier results (see Theorem 3.1) easily generalize to the slack case: 

D ed (K,K ) + 7 - D M (diag(£),diag(£ )) 

tv(K( ei - ej )(ei - ej ) T ) < Cij 6 S, (3-17) 

tv(K( ei - ej )(ei - ej) T ) > fa e V. 

The parameter 7 above controls the tradeoff between satisfying the constraints and minimizing 
D£d(K, Kq), and the entries of £0 are set to be u for corresponding similarity constraints and t for 
dissimilarity constraints. 

To solve problem (3.17), we employ the technique of Bregman projections, as discussed in the 
transductive setting [KSD06, KSD08]. At each iteration, we choose a constraint (i,j) from S or 
T>. We then apply a Bregman projection such that K satisfies the constraint after projection; note 
that the projection is not an orthogonal projection but is rather tailored to the particular function 
that we are optimizing. Algorithm 1 details the steps for Bregman's method on this optimization 
problem. Each update is given by a rank-one update 

K <- K + (3K(ei - cj)(ci - ej) T K, 

where (3 is an appropriate projection parameter that can be computed in closed form (see Algo- 
rithm 1). 

Algorithm 1 has a number of key properties which make it useful for various kernel learning 
tasks. First, the Bregman projections can be computed in closed form, assuring that the projection 



min 
s.t. 
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updates are efficient (0(n 2 )). Note that, if the feature space dimensionality d is less than n then a 
similar algorithm can be used directly in the feature space (see [DKJ + 07]). Instead of LogDet, if we 
use the von Neumann divergence, another potential loss function for this problem, 0(n 2 ) updates 
are possible, but are much more complicated and require use of the fast multipole method, which 
cannot be employed easily in practice. Secondly, the projections maintain positive definiteness, 
which avoids any eigenvector computation or semidefinite programming. This is in stark contrast 
with the Frobenius loss, which requires additional computation to maintain positive definiteness, 
leading to 0(n 3 ) updates. 

3.6 Metric/Kernel Learning with Large Datasets 

In Sections 3.1 and 3.3 we proposed a LogDet divergence based Mahalanobis metric learning prob- 
lem (3.1) and an equivalent kernel learning problem (3.2). The number of parameters involved in 
these problems is 0(min(n 2 , d 2 )), where n is the number of training points and d is the dimension- 
ality of the data. This quadratic dependency effects not only the running time for both training 
and testing, but also poses tremendous challenges in estimating a quadratic number of parameters. 
For example, a data set with 10,000 dimensions leads to a Mahalanobis matrix with 100 million 
values. This represents a fundamental limitation of existing approaches, as many modern data 
mining problems possess relatively high dimensionality. 

In this section, we present a method for learning structured Mahalanobis distance (kernel) 
functions that scale linearly with the dimensionality (or training set size). Instead of representing 
the Mahalanobis distance/kernel matrix as a full d x d (or n x n) matrix with 0(min(n 2 , d 2 )) 
parameters, our methods use compressed representations, admitting matrices parameterized by 
0(min(n, d)) values. This enables the Mahalanobis distance/kernel function to be learned, stored, 
and evaluated efficiently in the context of high dimensionality and large training set size. In 
particular, we propose a method to efficiently learn an identity plus low-rank Mahalanobis distance 
matrix and its equivalent kernel function. 

Now, we formulate the high-dimensional identity plus low-rank (IPLR) metric learning problem. 
Consider a low-dimensional subspace in R rf and let the columns of U form an orthogonal basis of 
this subspace. We will constrain the learned Mahalanobis distance matrix to be of the form: 

W = I d + Wi = I d + ULU T , (3.18) 

where I d is the d x d identity matrix, Wi denotes the low-rank part of W and L £ §^ xfc with 
k <C min(n, d). Analogous to (3.1), we propose the following problem to learn an identity plus 
low-rank Mahalanobis distance function: 

min D M (W,I d ) 

s.t. dw(xi,Xj) < u (i,j)eS, (3 19) 

d w (xi,Xj)>£ (iJ)eV, 
W = I d + ULU T . 

Note that the above problem is identical to (3.1) except for the added constraint W = I d + ULU T . 
Let F = I k + L. Now we have 

D M (W, I d ) = tr(I d + ULU T ) - log det(/ d + ULU T ) - d, 
= tr(/ fc + L) + d - k - log det(/ fc + L) - d, 

= D ed (F,I k ), (3.20) 
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where the second equality follows from the fact that tr(^4i?) = tr(BA) and Lemma 3.3. Also note 
that for all C € M. nxn , 

tr(WXCX T ) = tr{(I d + ULU T )XCX T ), 

= ti{XCX T ) + ti{LU T XCX T U), 

= tr{XCX T ) - tT{X'CX' T ) + tr(FX'CX' T ), 

where X' = U T X is the reduced-dimensional representation of X. Hence, 

d w (xi,Xj) = tv(WX(ei - e j )(e i - ej) T X T ) = di(xi,Xj) - diix'^x'j) + d F (x' i ,x' j ). (3.21) 

Using (3.20) and (3.21), problem (3.19) is equivalent to the following: 

mm D M (F,I k ) 

s.t. dF{x[, Xj) < u — dj(xi,Xj) + di(x' i ,x'j) (i,j)eS, (3.22) 
dp(Xi,Xj) > £-dj(xi,Xj) + di(xi,Xj) € V. 

Note that the above formulation is an instance of problem (3.1) and can be solved using an algorithm 
similar to Algorithm 1. Furthermore, the above problem solves for akxk matrix rather than adxd 
matrix seeming ly required by (3.19). The optimal W* is obtained as W* = I d + U{F* - I k )U T . 

Next, we show that problem (3.22) and equivalently (3.19) can be solved efficiently in feature 
space by selecting an appropriate basis R (U = R(R T R)~ 1 ^ 2 ). Let R = XJ, where J G R nxk . Note 
that U = XJ{J T K J)- l l 2 and X' = U T X = {J T K Q J)~ l l 2 J T K Q , i.e., X' E R kxn can be computed 
efficiently in the feature space (requiring inversion of only a k x k matrix). Hence, problem (3.22) 
can be solved efficiently in feature space using Algorithm 1 and the optimal kernel K* is given by 

K* = X T W*X = K + K J{J T K jy 1/2 (F* - I k )(J T K J)~ 1/2 J T K . 

Note that problem (3.22) can be solved via Algorithm 1 using 0(k 2 ) computational steps per 
iteration. Additionally, 0(min(n, d)k) steps are required to prepare the data. Also, the optimal 
solution W* (or K*) can be stored implicitly in 0(min(n, d)k) steps and similarly, the Mahalanobis 
distance between any two points can be computed in time 0(min(n, d)k) steps. 

The metric learning problem presented here depends critically on the basis selected. For the 
case when d is not significantly larger than n and feature space vectors X are available explicitly, 
the basis R can be selected by using one of the following heuristics (see Section 5, [DD08] for more 
details): 

• Using the top k singular vectors of X. 

• Clustering the columns of X and using the mean vectors as the basis R. 

• For the fully-supervised case, if the number of classes (c) is greater than the required dimen- 
sionality (fc) then cluster the class-mean vectors into k clusters and use the obtained cluster 
centers for forming the basis R. If c < k then cluster each class into k/c clusters and use the 
cluster centers to form R. 

For learning the kernel function, the basis R = XJ can be selected by: 1) using a randomly 
sampled coefficient matrix J, 2) clustering X using kernel /c-means or a spectral clustering method, 
3) choosing a random subset of X, i.e, the columns of J are random indicator vectors. A more 
careful selection of the basis R should further improve accuracy of our method and is left as a topic 
for future research. 
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4 Kernelization with Other Convex Loss Functions 



One of the key benefits to using the LogDet divergence for metric learning is its ability to efficiently 
learn a linear mapping for high-dimensional kernelized data. A natural question is whether one 
can kernelize metric learning with other loss functions, such as those considered previously in the 
literature. To this end, the work of [CKTK08] showed how to kernelize some popular metric learning 
algorithms such as MCML [GR05] and LMNN [WBS05] . In this section, we show a complementary 
result that shows how to kernelize a class of metric learning algorithms that learns a linear map in 
input or feature space. 

Consider the following (more) general optimization problem that may be viewed as a general- 
ization of (3.1) for learning a linear transformation matrix G, where W = G T G: 

nun tr(/(W)) 

s.t. tv(WXCiX T ) < b h VI < i < m 

W^O, (4.1) 

where / : R dxd -> R dxd , tr(/(W)) is a convex function, W £ S+ xd , X £ R dxn , and each C; £ M nxn 
is a symmetric matrix. Note that we have generalized both the loss function and the constraints. For 
example, the LogDet divergence can be viewed as a special case, since we may write D^iX, Y) = 
tr(XY -1 — log(Xy~ 1 ) — /). The loss function fiW) regularizes the learned transformation W 
against the baseline Euclidean distance metric, i.e., Wq = I. Hence, a desirable property of / 
would be: tv{f(W)) > with tr(/(PF)) = iff W = I. 

In this section we show that for a large and important class of functions /, problem (4.1) can be 
solved for W implicitly in the feature space, i.e., the problem (4.1) is kernelizable. We assume that 
the kernel function Ko(x,y) = <j)(x) T <j)(y) between any two data points can be computed in O(l) 
time. Denote W* as an optimal solution for (4.1). Now, we formally define kernelizable metric 
learning problems. 

Definition 4.1. An instance of metric learning problem (4-1) is kernelizable if the following con- 
ditions hold: 

• Problem (4-1) is solvable efficiently in time poly(n, m) without explicit use of feature space 
vectors X. 



ti(W*YCY T ), where Y € M rfxJV is the feature space representation of any given data points, 
can be computed in time poly(N) for all C £ 10 



Theorem 4.2. Let f : R — > M. be a function defined over the reals such that: 

• f{x) is a convex function. 

• A sub-gradient of f(x) can be computed efficiently in 0(1) time. 

• f( x ) > Vx with f{rj) = for some r] > 0. 

Consider the extension of f to the spectrum ofW £ S'j", i.e. f{W) = U f(K)U T , where W = UAU T 
is the eigenvalue decomposition of W (Definition 1.2, [Hig08]). Assuming X to be full-rank, i.e., 
Kq = X T X is invertible, problem (4-1) is kernelizable (Definition 4-1)- 

To prove the above theorem, we need the following two lemmas: 
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Lemma 4.3. Assuming f satisfies the conditions stated in Theorem 4. 2 and X is full-rank, 3S* G 
M nxn such that W* = nl + XS*X T is an optimal solution to (4-1). 

Proof. Let W = UAU T = J2j x 

jUjUj be the eigenvalue decomposition of W, where Ai > A2 > 
• • • > Arf > 0. Consider a linear constraint tr(W XCiX T ) < hi as specified in problem (4.1). Note 
that ti(WXCiX T ) = J2j XjuJXCiX T Uj. Note that if the j-th eigenvector Uj of W is orthogonal 
to the range space of X, i.e. X T uj = 0, then the corresponding eigenvalue Xj is not constrained 
(except for the non-negativity constraint imposed by the positive semi-definiteness constraint). 
Since the range space of X is at most n-dimensional, without loss of generality we can assume that 
Xj > 0, Vj > n are not constrained by the linear inequality constraints in (4.1). 

Furthermore, by the definition of a spectral function (Definition 1.2, [Hig08]), tr(f(W)) = 
^jf(Xj). Since / satisfies the conditions of Theorem 4.2, f(rj) = min x f(x) = 0. In order to 
minimize tr(/(W)), we can select Xj=rj> 0,Vj > n (note that the non-negativity constraint is 
satisfied for this choice of Xj). Furthermore, eigenvectors Uj, Vj < n, lie in the range space of X, 
i.e., Vj < n, Uj = Xotj for some (Xj £ M n . Hence, 

n d 

W* = J2Ku*uf + V £ u*uf, 

j=l j=n+l 
n d 

= J>* - V )u*uf + r, u*uf, 
i=i i=i 

n 

= Y,x((x*- v ) a * a f)x T + v i d , 

= XS*X T + rjl d , 

where S* = E"=i(A* - v)^af. □ 
Lemma 4.4. If n < d and X £ W ixn has full column rank, i.e., X T X is invertible then: 

xsx T ^0 <^> s y 0. 



Proof. 

XSX T y =► v T XSX T v > 0,Vv G M d . Since X has full column rank, Vq G R™ G JR d s.t. 
= q. Hence, q T 5g = i> T XSX T -u > 0, Vq G R n ^5^0 

Now Vu G M d , u T XS , X T ^ > as 5 b 0. Thus XSX T h0. □ 

We now present a proof of Theorem 4.2. The key idea is to prove that (4.1) can solved implicitly 
by solving for S* of Lemma 4.3. 

Proof. [Theorem 4.2] 

Using Lemma 4.3, W* is of the form W* = i]I d +XS*X T . Assuming X is full-rank, i.e., all the data 
points Xi are linearly independent, then there is a one-to-one mapping between W* and S*. Hence, 
solving for W* is equivalent to solving for S* . So, now our goal is to reformulate problem (4.1) in 
terms of S*. 
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Let X = XJx^xVx be the SVD of x - Then ' 



W = rjI d + XSX T , 

= v i d + u x z x v£sv x z x u T 



[u x u x 



r]I n - d 





'ul' 







(4.2) 



where UlU = 0. 

Now, consider f(W) = f{r]I d + XSX T ). Using (4.2): 

f(W) = f( V I d + XSX T ), 

~ZxV%SVxX x + r)I n 



f\[Ux U ± ] 
[Ux U ± ]f( 
[U x U ± 





v jn~d 



u T x 
ul 



^xV^SV x ^ x + V l n 







rjl 





n—d 



u T x 
ul 



f (ZxV£SVxZ x + vl n ) 











U T X 

ul 



= U x f (Z x V£SV x Zx + Vl n ) Ul, 
where the second equality follows from the property that f(QZQ T ) = Qf(Z)Q T for an orthogonal 
Q and a spectral function /. The third equality follows from the property that / 



(f(A) 
V f{B) 



A 
B 



and the fact that f{rf) = 0. Hence, 



tr(/(W)) = / {Z x V%SVxVx + Vl") ■ 
Next, consider the constraint ti(WXCiX T ) < hi. Note that 

tr(WXdX T ) = tv((?]I d + XSX T )XdX T ) = tr(? ? C^ + CiK SK ). 
Hence, the constraint tv(WXCiX T ) < hi reduces to: 

tr(rjCiK + CiK SK ) < b { . 
Finally, consider the constraint W y 0. Using (4.2), we see that this is equivalent to: 

V I n + E x V^SV x ^x t 0, 

s y - V Kv\ 

where K = X T X = V X Y, X V% . 

Using (4.3), (4.5), and (4.6) we get the following problem which is equivalent to (4.1): 

nun f(^ x V^SV x ^ x + V I n ) 

s.t. tr(r)CiK + CiK SK ) <b u VI < i < m 
Sh-vK, 1 . 



(4.3) 
(4.4) 
(4.5) 



(4.6) 



(4.7) 
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Note that the objective function is a strictly convex function of a linear transformation of S, and 
hence is strictly convex in S. Furthermore, all the constraints are linear in S. As a result, problem 
(4.7) is a convex program. Also, both Ex and Vx can be computed efficiently in 0(n 3 ) steps 
using eigenvalue decomposition of Kq = X T X. Hence, problem (4.1) can be solved efficiently 
in poly(ra, m) steps using standard convex optimization methods such as the ellipsoid method 
[GLS88]. □ 



5 Special Cases 

In the previous section, we proved a general result on kernelization of metric learning. In this 
section, we further consider a few special cases of interest: the von Neumann divergence, the 
squared Frobenius norm and semi-definite programming. For each of the cases, we derive the 
required optimization problem to be solved and mention the relevant optimization algorithms that 
can be used. 



5.1 von Neumann Divergence 

The von Neumann divergence is a generalization of the well known KL-divergence to matrices. 
It is used extensively in quantum computing to compare density matrices of two different sys- 
tems [NCOO]. It is also used in the exponentiated matrix gradient method by [TRW05], online-PCA 
method by [WK08] and fast SVD solver by [AK07]. The von Neumann divergence between W and 
Wo is defined to be: 

D yN (W, W ) = tr(W log W - W log W - W + W ), 

where both W and Wo are positive definite. The metric learning problem that corresponds to (4.1) 
is: 

min D vN (W,I) 
w 

s.t. tr (WXdX T ) <b u VI < i < m, 

W^O. (5.1) 

It is easy to see that D v n(W, I) = ti(f v -^(W)), where 

f vN (W) = Wlog W - W + / = [// vN (A)[/ T , 

where W = U AU T is the eigenvalue decomposition of W and / v n : K — > R, f v ^(x) = x log x — x + 1. 
Also, note that f v ^(x) is a strictly convex function with argmin^ f v ^(x) = 1 and / v n(1) = 0. Hence, 
using Theorem 4.2, problem (5.1) is kernelizable since D v i<i(W, I) satisfies the required conditions. 
Using (4.7), the optimization problem to be solved is given by: 

mm D vN (Z X V£SV X Z X + I n , I n ) 

s.t. tv(CiK + dKoSKo) <b h VI < i < m 

S h -Kq\ (5.2) 

Next, we derive a simplified version of the above optimization problem. 
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Note that D v n(-, •) is defined only for positive semi-definite matrices. Hence, the constraint 
S y —Kq 1 should be satisfied if the above problem is feasible. Thus, the reduced optimization 
problem is given by: 

min D vN (Z X V£SV X Z X + I n , I n ) 

s.t. tr(QA'o + CiK SK ) < b h VI < i < m. (5.3) 

Note that the von-Neumann divergence is a Bregman matrix divergence (see Equation (3.4)) with 
the generating function <p(X) = tr(Xlog A — A). Now using Lemma 3.2 and simplifying using the 
fact that dtr(^nogX) — log A, we get the following dual for problem (5.1): 

max - tr{exp{-Z x VxC{\)V x Z x )) - 6(A) 

A 

s.t. A > 0, (5.4) 

where C(A) = Yli and 6(A) = ^ AA- 

Now, using V X T, X V% = K we see that: tr(-Z x V%C(X)V x Z x ) k ) = tr((-C(X)K ) k ). Next, 
using the Taylor series expansion for the matrix exponential: 

T i 



tr(eM-ZxV%C(\)V x Z x )) = tr £ (-^V X C(X)V X E X ) 



= ™ tr((-z x yTc(\)v x z x y) 

i=0 

= 2^ t, = tr(exp(-C(A)A )). 

i=0 

Hence, the resulting dual problem is given by: 

min F(X) = tr(exp(-C(A)A )) + 6(A) 

A 

s.t. A > 0. (5.5) 

Also, ff = tr(exp(-C(A)Ao)CiA ) + 6;. Hence, any first order smooth optimization method can 
be used to solve the above dual problem. Also, similar to [KSD06], a Bregman's cyclic projection 
method can be used to solve the primal problem (5.3). 

5.2 Squared Probenius Divergence 

The squared Frobenius norm divergence is defined as: 

D boh (W,Wo) = ^\\W-W \\ 2 F , 

and is a popular measure of distance between matrices. Consider the following instance of (4.1) 
with the squared Frobenius divergence as the objective function: 

min D fmh (W,r]I) 
w 

s.t. tv(WXCiX T ) <bi, VI < i < m, 

wyo. (5.6) 
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Note that for 77 = and Q = (e a — e&)(e a — eb) T — (e a — e c )(e a — e c ) T (relative distance constraints), 
the above problem (5.6) is the same as the one proposed by [SSSN04]. Below we see that, similar to 
[SSSN04], Theorem 4.2 in Section 4 guarantees kernelization for a more general class of Frobenius 
divergence based objective functions. 

It is easy to see that Z?f ro b(W, r]I) = tr(/f ro b(W)), where 

hob(W) = (W- r,I) T {W - V I) = Uf {toh (A)U T , 

W = UAU T is the eigenvalue decomposition of W and f{ ro b : M — > TSL, /ftob(^) = (x — rf) 2 . Note 
that fi ro h(x) is a strictly convex function with argmin^ /f ro b(x) = r] and /f ro b( r ?) = 0. Hence, using 
Theorem 4.2, problem (5.1) is kernelizable since Df ro b(W, r]I) satisfies the required conditions. 
Using (4.7), the optimization problem to be solved is given by: 

min WE x Vx SVx^xWf 
s 

s.t. tr{r)CiK + CiK SK ) <b u VI < i < m 

S h -vKq 1 , (5.7) 

Also, note that \\T,xVx SVx^x IIf = ^(KqSKqS). The above problem can be solved using standard 
convex optimization techniques like interior point methods. 



5.3 SDPs 

In this section we consider the case when the objective function in (4.1) is a linear function. A 
similar formulation for metric learning was proposed by [WBS05] . We consider the following generic 
semidefinite program (SDP) to learn a linear transformation W: 

min tr(XC X T W) 

s.t. tv(WXCiX T ) <bi, VI < i < m 

WhO. (5.8) 



Here we show that this problem can be efficiently solved for high dimensional data in its kernel 
space. 

Theorem 5.1. Problem (5.8) is kernelizable. 

Proof. (5.8) has a linear objective, i.e., it is a non-strict convex problem that may have multiple 
solutions. A variety of regularizations can be considered that lead to slightly different solutions. 
Here, we consider two regularizations: 

• Frobenius norm: We add a squared Frobenius norm regularization to (5.8) so as to find 
the minimum Frobenius norm solution to (5.8) (when 7 is sufficiently small): 

min tv(XC X T W) + -\\W\\ 2 F 
w 2 

s.t. tv(WXdX T ) <b u VI < i < m, 

WhO. (5.9) 
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Consider the following variational formulation of the problem: 



minmin t + 7||W||p 
t w 

s.t. tv{WXCiX T ) <b h Ml<i<m 
tv(XC X T W) < t 

W tO. (5.10) 

Note that for constant t, the inner minimization problem in the above problem is similar to 
(5.6) and hence can be kernelized. Corresponding optimization problem is given by: 

min t + jtT(K SK S) 

s.t. tr(CiK Q SK ) < bi, VI < i < m 
tv(C K SK ) < t 

S^O, (5.11) 
Similar to (5.7), the above problem can be solved using convex optimization methods. 

Log determinant: In this case we seek the solution to (5.8) with minimum determinant. 
To this effect, we add a log-determinant regularization: 



min tr(XC X T W) - 7 log det W 
w 



s.t. tr(WXdX T ) <b h yi<i< m, 

W t0. (5.12) 

The above regularization was also considered by [KSD09], which provided a fast projection 
algorithm for the case when each Cj is a one-rank matrix and discussed conditions for which 
the optimal solution to the regularized problem is an optimal solution to the original SDP. 

Consider the following variational formulation of (5.12): 

min min t — 7 log det W 

t w 

s.t. ti(WXCiX T ) <bi, VI < i < m, 
tr(XC X T W) < t, 

W t0. (5.13) 

Note that the objective function of the inner optimization problem of (5.13) satisfies the 
conditions of Theorem 4.2, and hence (5.13) or equivalently (5.12) is kernelizable. 

□ 



6 Experimental Results 

In Section 3, we presented metric learning as a constrained LogDet optimization problem to learn a 
linear transformation, and we showed that the problem can be efficiently kernelized. Kernelization 
yields two fundamental advantages over standard non- kernelized metric learning. First, a non- 
linear kernel can be used to learn non-linear decision boundaries common in applications such as 
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Figure 1: Results over benchmark UCI data sets. LogDet metric learning was run with in input 
space (LogDet Linear) as well as in kernel space with a Gaussian kernel (LogDet Gaussian). 



image analysis. Second, in Section 3.6, we showed that the kernelized problem can be learned with 
respect to a reduced basis of size k, admitting a learned kernel parameterized by 0(k 2 ) values. 
When the number of training examples n is large, this represents a substantial improvement over 
optimizing over the entire 0(n 2 ) kernel matrix, both in terms of computationally efficiency as well 
as statistical robustness. 

In this section, we present experiments from two domains: text analysis and imaging processing. 
As mentioned, image data sets tend to have highly non-linear decision boundaries. To this end, we 
learn a kernel matrix when the baseline kernel Kq is the pyramid match kernel, a method specifically 
designed for object /image recognition [GD05]. In contrast, text data sets tend to perform quite 
well with linear models, and the text experiments presented here have large training sets. We show 
that high quality metrics can be learned using a relatively small set of basis vectors. 

We evaluate performance of our learned distance metrics in the context of classification accuracy 
for the /c-nearest neighbor algorithm. Our /c-nearest neighbor classifier uses k = 10 nearest neighbors 
(except for section 6.2 where we use k = 1), breaking ties arbitrarily. We select the value of k 
arbitrarily and expect to get slightly better accuracies using cross-validation. Accuracy is defined 
as the number of correctly classified examples divided by the total number of classified examples. 

For our proposed algorithms, pairwise constraints are inferred from true class labels. For each 
class i, 100 pairs of points are randomly chosen from within class i and are constrained to be similar, 
and 100 pairs of points are drawn from classes other than i to form dissimilarity constraints. Given 
c classes, this results in 100c similarity constraints, and 100c dissimilarity constraints, for a total 
of 200c constraints. The upper and lower bounds for the similarity and dissimilarity constraints 
are determined empirically as the I s * and 99 th percentiles of the distribution of distances computed 
using a baseline Mahalanobis distance parameterized by Wq. Finally, the slack penalty parameter 
7 used by our algorithms is cross- validated using values {.01, .1, 1, 10, 100, 1000}. 

All metrics are trained using data only in the training set. Test instances are drawn from the 
test set and are compared to examples in the training set using the learned distance function. The 
test and training sets are established using a standard two-fold cross validation approach. For 
experiments in which a baseline distance metric is evaluated (for example, the squared Euclidean 
distance), nearest neighbor searches are again computed from test instances to only those instances 
in the training set. 
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6.1 Low-Dimensional Data Sets 



First we evaluate our metric learning method on the standard UCI datasets in the low-dimensional 
(non-kernelized) setting, to directly compare with several existing metric learning methods. In 
Figure 1, we compare LogDet Linear (Kq equals the linear kernel) and the LogDet Gaussian [Kq 
equals Gaussian kernel in kernel space) algorithms against existing metric learning methods for k- 
NN classification. We use the squared Euclidean distance, d(x, y) = (x — y) T (x — y) as a baseline 
method. We also use a Mahalanobis distance parameterized by the inverse of the sample covariance 
matrix. This method is equivalent to first performing a standard PCA whitening transform over 
the feature space and then computing distances using the squared Euclidean distance. We compare 
our method to two recently proposed algorithms: Maximally Collapsing Metric Learning [GR05] 
(MCML), and metric learning via Large Margin Nearest Neighbor [WBS05] (LMNN). Consistent 
with existing work such as [GR05], we found the method of [XNJR02] to be very slow and inaccurate, 
so the latter was not included in our experiments. As seen in Figure 1, LogDet Linear and LogDet 
Gaussian algorithms obtain somewhat higher accuracy for most of the datasets. 




(a) Clarify Datasets (6) Latex 

Figure 2: Classification error rates for fc-nearest neighbor software support via different learned 
metrics. We see in figure (a) that LogDet Linear is the only algorithm to be optimal (within the 
95% confidence intervals) across all datasets. LogDet is also robust at learning metrics over higher 
dimensions. In (b), we see that the error rate for the Latex dataset stays relatively constant for 
LogDet Linear. 

In addition to our evaluations on standard UCI datasets, we also apply our algorithm to the 
recently proposed problem of nearest neighbor software support for the Clarify system [HRD + 07]. 
The basis of the Clarify system lies in the fact that modern software design promotes modularity and 
abstraction. When a program terminates abnormally, it is often unclear which component should be 
responsible for (or is capable of) providing an error report. The system works by monitoring a set of 
predefined program features (the datasets presented use function counts) during program runtime 
which are then used by a classifier in the event of abnormal program termination. Nearest neighbor 
searches are particularly relevant to this problem. Ideally, the neighbors returned should not only 
have the correct class label, but should also represent those with similar program configurations 
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Table 1: Training time (in seconds) for the results presented in Figure 2(b). 



Dataset 


LogDet Linear 


MCML 


LMNN 


Latex 


0.0517 


19.8 


0.538 


Mpg321 


0.0808 


0.460 


0.253 


Foxpro 


0.0793 


0.152 


0.189 


Iptables 


0.149 


0.0838 


4.19 



Table 2: Unsupervised fc-means clustering error using the baseline squared Euclidean distance, 
along with semi-supervised clustering error with 50 constraints. 



Dataset 


Unsupervised 


LogDet Linear 


HMRF-KMeans 


Ionosphere 


0.314 


0.113 


0.256 


Digits-389 


0.226 


0.175 


0.286 



or program inputs. Such a matching can be a powerful tool to help users diagnose the root cause 
of their problem. The four datasets we use correspond to the following softwares: Latex (the 
document compiler, 9 classes), Mpg321 (an mp3 player, 4 classes), Foxpro (a database manager, 4 
classes), and Iptables (a Linux kernel application, 5 classes). 

Our experiments on the Clarify system, like the UCI data, are over fairly low-dimensional data. 
It was shown [HRD + 07] that high classification accuracy can be obtained by using a relatively small 
subset of available features. Thus, for each dataset, we use a standard information gain feature 
selection test to obtain a reduced feature set of size 20. From this, we learn metrics for fc-NN 
classification using the methods developed in this paper. Results are given in Figure 2(b). The 
LogDet Linear algorithm yields significant gains for the Latex benchmark. Note that for datasets 
where Euclidean distance performs better than using the inverse covariance metric, the LogDet 
Linear algorithm that normalizes to the standard Euclidean distance yields higher accuracy than 
that regularized to the inverse covariance matrix (LogDet-Inverse Covariance). In general, for the 
Mpg321, Foxpro, and Iptables datasets, learned metrics yield only marginal gains over the baseline 
Euclidean distance measure. 

Figure 2(c) shows the error rate for the Latex datasets with a varying number of features (the 
feature sets are again chosen using the information gain criteria). We see here that LogDet Linear 
is surprisingly robust. Euclidean distance, MCML, and LMNN all achieve their best error rates for 
five dimensions. LogDet Linear, however, attains its lowest error rate of .15 at d = 20 dimensions. 

In Table 1, we see that LogDet Linear generally learns metrics significantly faster than other 
metric learning algorithms. The implementations for MCML and LMNN were obtained from their 
respective authors. The timing tests were run on a dual processor 3.2 GHz Intel Xeon processor 
running Ubuntu Linux. Time given is in seconds and represents the average over 5 runs. 

We also present some semi-supervised clustering results for two of the UCI data sets. Note 
that both MCML and LMNN are not amenable to optimization subject to pairwise distance con- 
straints. Instead, we compare our method to the semi-supervised clustering algorithm HMRF- 
KMeans [BBM04]. We use a standard 2-fold cross validation approach for evaluating semi-supervised 
clustering results. Distances are constrained to be either similar or dissimilar, based on class values, 
and are drawn only from the training set. The entire dataset is then clustered into c clusters using 
A;- means (where c is the number of classes) and error is computed using only the test set. Table 2 
provides results for the baseline /c-means error, as well as semi-supervised clustering results with 
50 constraints. 
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Caltech 101 : Comparison to Existing Methods 




Fei-Fei et al. (ICCV03) 
A SSD baseline 
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Figure 3: Caltech-101: Comparison of LogDet based metric learning method with other state- 
of-the-art object recognition methods. Our method outperforms all other single metric/kernel 
approaches. ML+SUM refers to our learned kernel when the average of four kernels (PMK [GD05], 
SPMK [LSP06], Geoblur-1, Geoblur-2 [BM01]) is the base kernel, ML+PMK refers to the learned 
kernel over the pyramid match [GD05] as the base kernel, and ML+CORR refers to the learned 
kernel when the correspondence kernel of [ZBMM06] is the base kernel. 

6.2 Metric Learning for Object Recognition 

Next we evaluate our method over high-dimensional data applied to the object-recognition task 
using Caltech-101 [Cal04], a common benchmark for this task. The goal is to predict the category 
of the object in the given image using a fc-NN classifier. 

We compute distances between images using learning kernels with three different base image 
kernels: 1) PMK: Grauman and Darren's pyramid match kernel [GD05] applied to SIFT features, 
2) CORR: the kernel designed by [ZBMM06] applied to geometric blur features , and 3) SUM: 
the average of four image kernels, namely, PMK [GD05], Spatial PMK [LSP06], Geoblur-1, and 
Geoblur-2 [BM01]. Note that the underlying dimensionality of these embeddings are typically in 
the millions of dimensions. 

We evaluate the effectiveness of metric/kernel learning on this dataset. We pose a fe-NN clas- 
sification task, and evaluate both the original (SUM, PMK or CORR) and learned kernels. We 
set k = 1 for our experiments; this value was chosen arbitrarily. We vary the number of training 
examples T per class for the database, using the remainder as test examples, and measure accuracy 
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Caltech 101 : Gains over Baseline 
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Figure 4: Object recognition on the Caltech-101 dataset. Our learned kernels significantly improve 
NN recognition accuracy relative to their non-learned counterparts, the SUM (average of four 
kernels), the CORR and PMK kernels. 

in terms of the mean recognition rate per class, as is standard practice for this dataset. 

Figure 3 shows our results relative to several other existing techniques that have been applied to 
this dataset. Our approach outperforms all existing single-kernel classifier methods when using the 
learned CORR kernel: we achieve 61.0% accuracy for T = 15 and 69.6% accuracy for T = 30. Our 
learned PMK achieves 52.2% accuracy for T = 15 and 62.1% accuracy for T = 30. Similarly, our 
learned SUM kernel achieves 73.7% accuracy for T = 15. Figure 4 specifically shows the comparison 
of the original baseline kernels for NN classification. The plot reveals gains in 1-NN classification 
accuracy; notably, our learned kernels with simple NN classification also outperform the baseline 
kernels when used with SVMs [ZBMM06, GD05]. 

6.3 Metric Learning for Text Classification 

Next we present results in the text domain. Our text datasets are created by standard bag-of-words 
Tf-Idf representations. Words are stemmed using a standard Porter stemmer and common stop 
words are removed, and the text models are limited to the 5,000 words with the largest document 
frequency counts. We provide experiments for two data sets: CMU Newsgroups [CMU08], and 
Classic3 [Cla08]. Classic3 is a relatively small 3 class problem with 3,891 instances. The newsgroup 
data set is much larger, having 20 different classes from various newsgroup categories and 20,000 
instances. 

As mentioned earlier, our text experiments use a linear kernel, and we use a set of basis vectors 
that is constructed from the class labels via the following procedure. Let c be the number of distinct 
classes and let k be the size of the desired basis. If k = c, then each class mean n is computed 
to form the basis R = [n . . . r c ] . If k < c a similar process is used but restricted to a randomly 
selected subset of k classes. If k > c, instances within each class are clustered into approximately 
£ clusters. Each cluster's mean vector is then computed to form the set of low-rank basis vectors 
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Figure 5: Classification accuracy for our Mahalanobis metrics learned over basis of different dimen- 
sionality. Overall, our method (LogDet Linear) significantly outperforms existing methods. 

R. 

Figure 5 shows classification accuracy across bases of varying sizes for the Classic3 dataset, 
along with the newsgroup data set. As baseline measures, the standard squared Euclidean distance 
is shown, along with Latent Semantic Analysis (LSA) [DDL + 90], which works by projecting the 
data via principal components analysis (PCA), and computing distances in this projected space. 
Comparing our algorithm to the baseline Euclidean measure, we can see that for smaller bases, the 
accuracy of our algorithm is similar to the Euclidean measure. As the size of the basis increases, 
our method obtains significantly higher accuracy compared to the baseline Euclidean measure. 



7 Conclusions 

In this paper, we have considered the general problem of learning a linear transformation of input 
data and applied it to the problem of learning a metric over high-dimensional data or feature 
space implicitly. 4>(xi) T A<p(xj). We first showed that the LogDet divergence is a useful loss for 
learning a linear transformation (or performing metric learning) in kernel space, as the algorithm 
can easily be generalized to work in kernel space. We then proposed an algorithm based on Bregman 
projections to learn a kernel function over the data-points efficiently. We also show that our learned 
metric can be restricted to a small dimensional basis efficiently, hence scaling our method to large 
datasets with high-dimensional feature space. Then we considered a larger class of convex loss 
functions for learning the metric/kernel using a linear transformation of the data; we saw that 
many loss functions can lead to kernelization, though the resulting optimizations may be more 
expensive to solve than the simpler LogDet formulation. Finally, we presented some experiments 
on benchmark data, high-dimensional vision, and text classification problems, demonstrating our 
method compared to several existing state-of-the-art techniques. 

There are several directions of future work. To facilitate even larger data sets than the ones 
considered in this paper, online learning methods are one promising research direction; in [JKDG08], 
an online learning algorithm was proposed based on LogDet regularization, and this remains a part 
of our ongoing efforts. Recently, there has been some interest in learning multiple local metrics 
over the data; [WS08] considered this problem. We plan to explore this setting with the LogDet 
divergence, with a focus on scalability to very large data sets. 
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